Let’s first plot the histogram of fixed acidity
The Fixed acidity value seems to dispaly a normal distribution. Let’s see volatile acidity distribution
The Volatile acidity value seems to dispaly a more like normal distribution ontaking the log distribution.
Let’s see more features’ ditribution From Above plots, following observations are made:
The histogram is highly skewed to left.
Quality is distributed from 3 - 8. Most wine exhibit medium(5 - 6) quality.
Most of the wines fall in the range of 4 to 6 in terms of quality.
There are 1599 red wine in this data set with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality)
The main features in the data set are alcohol and quality. I’d like to determine which features are best for predicting quality of wine. I think alcohol, quantity of SO2~ (free and total) and acidity (both fixed and volatile) might be used for predictive modeling to determine quality of wine.
alcohol, quantity of SO2~ (free and total) and acidity (both fixed and volatile)
Yes, quality.level is the variable added to the dataset which distributes the sample into 3 quality bins (0,4], (4,6] and (6,10].
According to all the above plots, there are some outliers in some of the features like SO2(free and total), acidity (fixed and volatile). Also the distribution for Volatile acidity apears to be bimodal normal distribution. But when taking log distribution, the plot becomes normal distributed.
No
From the correlation matrix, the following behaviors are observed:
1.Fixed Acidity shows significant negative correlation with pH and volatile acidity.
2.Volatile Acidity is highly negatively correlated with citric acid and quality.
3.Free SO2 shows significant positive correlation with total SO2.
4.Density shows significant negative correlation with alcohol, acidity (fixed and citric acid) and pH.
Also from above scatterplot matrix, chlorides and sulphates doesn’t seem to have any kind of effect to quality.
Let’s have some box plots with quality level to observe the outliers.
For PH, most of the outliers seem to lie in quality range (4,6].
For aclcohol, most of the outliers also seem to lie in quality range (4,6].
Only a few outliers are obersrved for citric acid.
For SO2, it contains outliers in all the quality level range.
Chlorides and sulphates does not exhibit any significant relationships with any other features. Also, most of the outliers are in the quality range (4,6] and this is not good for the prediction models.
Let’s now dig deeper into the correlation between quality and other features:
There seems to be no significant bias of the alcohol content. With some exceptions that some samples with higer Alcohol content exhibiting a higher density reading for the quality levels equaling to 3 and 5.
Negative correlation of volatile acidity and quality are summarized below:
It seems that wine with higher volatile acidity exhibiting higher density for quality levels 5,7 and 8.
Lets find out the relation between residual sugar and quality.
Quality rating shows higher density of residual sugar (while quailty=3 is little lower). But no significant pattern is observed, thus sugar wouldn’t be helpful to predict quality.
## quality.level Mean_Alcohol Median_Alcohol
## 1 (0,4] 10.21587 10.0
## 2 (4,6] 10.25272 10.0
## 3 (6,10] 11.51805 11.6
Good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5. That is, if we have certain levels of both then we have higher quality.
Negative correlation is observed here. Most of wine samples with quality 5 seems to be distributed with alcohol content less 11% by volume, while samples with quality 7 above 11% alcohol by volume.
Good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5.
Good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5. That is, if we have certain levels of both then we have higher quality.
Before the testing, I thought residual sugar will play an important role in defining the quality of wine (which it does). However, being significant in every level of wine qulaity will not actually help me to determine the quality level.
Good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5. That is, if we have certain levels of both then we have higher quality.
The data set contains information on almost 1599 wine sampels across 12. In initial phase, I started understanding individual variables(univariate analysis), from which I explored interesting questions and made observations. Then I explored quality of wine accross mltiple variables (bivariate analysis and multivariate analysis).
There are many other factors that are related with good wines. Many of them are related with smells and flavours and not with chemical properties and gustative perceptions like these that we have in our dataset. Although our variables are kind of explanatory of what we have, we have also seen some cases where the must be other explanations for high or low quality levels.
One of the major challenges in this analysis was the limitations of the dataset. The variable of interest, wine quality, was an integer value measured on a scale of 0 to 10. However, the vast majority of the wines (1,319 out of 1,599) received a score of 5 or 6. Only 63 wines received a score of 3 or 4, and 217 wines received a score of 7 or 8. No wines received scores of 0, 1, 9, or 10. Since the wine quality variable had such limited variability, it was difficult to assess the relationship between quality and the chemical attribute variables. Having a greater variety of quality ratings or having finer gradations in the quality ratings might have allowed for a more nuanced analysis.
http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software https://medium.freecodecamp.org/using-data-science-to-understand-what-makes-wine-taste-good-669b496c67ee https://medium.com/@jeromevonk/red-wine-quality-exploration-ea88e6b0e3c5